6 research outputs found
Conformal Credal Self-Supervised Learning
In semi-supervised learning, the paradigm of self-training refers to the idea
of learning from pseudo-labels suggested by the learner itself. Across various
domains, corresponding methods have proven effective and achieve
state-of-the-art performance. However, pseudo-labels typically stem from ad-hoc
heuristics, relying on the quality of the predictions though without
guaranteeing their validity. One such method, so-called credal self-supervised
learning, maintains pseudo-supervision in the form of sets of (instead of
single) probability distributions over labels, thereby allowing for a flexible
yet uncertainty-aware labeling. Again, however, there is no justification
beyond empirical effectiveness. To address this deficiency, we make use of
conformal prediction, an approach that comes with guarantees on the validity of
set-valued predictions. As a result, the construction of credal sets of labels
is supported by a rigorous theoretical foundation, leading to better calibrated
and less error-prone supervision for unlabeled data. Along with this, we
present effective algorithms for learning from credal self-supervision. An
empirical study demonstrates excellent calibration properties of the
pseudo-supervision, as well as the competitiveness of our method on several
benchmark datasets.Comment: 26 pages, 5 figures, 10 tables, to be published at the 12th Symposium
on Conformal and Probabilistic Prediction with Applications (COPA 2023
Detecting Novelties with Empty Classes
For open world applications, deep neural networks (DNNs) need to be aware of
previously unseen data and adaptable to evolving environments. Furthermore, it
is desirable to detect and learn novel classes which are not included in the
DNNs underlying set of semantic classes in an unsupervised fashion. The method
proposed in this article builds upon anomaly detection to retrieve
out-of-distribution (OoD) data as candidates for new classes. We thereafter
extend the DNN by empty classes and fine-tune it on the OoD data samples.
To this end, we introduce two loss functions, which 1) entice the DNN to assign
OoD samples to the empty classes and 2) to minimize the inner-class feature
distances between them. Thus, instead of ground truth which contains labels for
the different novel classes, the DNN obtains a single OoD label together with a
distance matrix, which is computed in advance. We perform several experiments
for image classification and semantic segmentation, which demonstrate that a
DNN can extend its own semantic space by multiple classes without having access
to ground truth.Comment: 13 pages, 13 figures, 4 table
Memorization-Dilation: Modeling Neural Collapse Under Label Noise
The notion of neural collapse refers to several emergent phenomena that have
been empirically observed across various canonical classification problems.
During the terminal phase of training a deep neural network, the feature
embedding of all examples of the same class tend to collapse to a single
representation, and the features of different classes tend to separate as much
as possible. Neural collapse is often studied through a simplified model,
called the unconstrained feature representation, in which the model is assumed
to have "infinite expressivity" and can map each data point to any arbitrary
representation. In this work, we propose a more realistic variant of the
unconstrained feature representation that takes the limited expressivity of the
network into account. Empirical evidence suggests that the memorization of
noisy data points leads to a degradation (dilation) of the neural collapse.
Using a model of the memorization-dilation (M-D) phenomenon, we show one
mechanism by which different losses lead to different performances of the
trained network on noisy data. Our proofs reveal why label smoothing, a
modification of cross-entropy empirically observed to produce a regularization
effect, leads to improved generalization in classification tasks.Comment: to be published at ICLR 202
From Label Smoothing to Label Relaxation
Regularization of (deep) learning models can be realized at the model, loss, or data level. As a technique somewhere in-between loss and data, label smoothing turns deterministic class labels into probability distributions, for example by uniformly distributing a certain part of the probability mass over all classes. A predictive model is then trained on these distributions as targets, using cross-entropy as loss function. While this method has shown improved performance compared to non-smoothed cross-entropy, we argue that the use of a smoothed though still precise probability distribution as a target can be questioned from a theoretical perspective. As an alternative, we propose a generalized technique called label relaxation, in which the target is a set of probabilities represented in terms of an upper probability distribution. This leads to a genuine relaxation of the target instead of a distortion, thereby reducing the risk of incorporating an undesirable bias in the learning process. Methodically, label relaxation leads to the minimization of a novel type of loss function, for which we propose a suitable closed-form expression for model optimization. The effectiveness of the approach is demonstrated in an empirical study on image data
Kronecker Decomposition for Knowledge Graph Embeddings
Knowledge graph embedding research has mainly focused on learning continuous
representations of entities and relations tailored towards the link prediction
problem. Recent results indicate an ever increasing predictive ability of
current approaches on benchmark datasets. However, this effectiveness often
comes with the cost of over-parameterization and increased computationally
complexity. The former induces extensive hyperparameter optimization to
mitigate malicious overfitting. The latter magnifies the importance of winning
the hardware lottery. Here, we investigate a remedy for the first problem. We
propose a technique based on Kronecker decomposition to reduce the number of
parameters in a knowledge graph embedding model, while retaining its
expressiveness. Through Kronecker decomposition, large embedding matrices are
split into smaller embedding matrices during the training process. Hence,
embeddings of knowledge graphs are not plainly retrieved but reconstructed on
the fly. The decomposition ensures that elementwise interactions between three
embedding vectors are extended with interactions within each embedding vector.
This implicitly reduces redundancy in embedding vectors and encourages feature
reuse. To quantify the impact of applying Kronecker decomposition on embedding
matrices, we conduct a series of experiments on benchmark datasets. Our
experiments suggest that applying Kronecker decomposition on embedding matrices
leads to an improved parameter efficiency on all benchmark datasets. Moreover,
empirical evidence suggests that reconstructed embeddings entail robustness
against noise in the input knowledge graph. To foster reproducible research, we
provide an open-source implementation of our approach, including training and
evaluation scripts as well as pre-trained models in our knowledge graph
embedding framework (https://github.com/dice-group/dice-embeddings).Comment: Accepted at HT 202